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Minimalist's Linux Cluster 

Chang- Yeong Choi, Jeong-Hyun Kim, and Seyong Kim, a * 
a Department of Physics, Sejong University, Seoul 143-747, Korea 

Using barebone PC components and NIC's, we construct a linux cluster which has 2-dimensional mesh structure. 
This cluster has smaller footprint, is less expensive, and use less power compared to conventional linux cluster. 
Here, we report our experience in building such a machine and discuss our current lattice project on the machine. 
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1. Motivation 

Constructing a Linux cluster using commodity 
PC's and commodity networking hardware be- 
came quite easy and using such a cluster for a 
lattice QCD project is very popular pp. This in- 
creases the range of computing power available to 
those who have only moderate means. However, 
from our experience of using such a cluster we 
found that there is a room for improvement in 
scaling up the current Linux cluster architecture 
: first, if many PC's are just stacked on top of 
each other in rows, soon the cluster begins to oc- 
cupy too large physical space. Secondly, not all 
the components in a standard PC is essential for 
a lattice simulation. By getting rid of unneces- 
sary parts, one may reduce overall cost and elec- 
trical power requirement for each PC's. Third, 
switched ethernet hub is usually used in a linux 
cluster and providing full bisection bandwidth us- 
ing such switches is expensive. Building a cluster 
without a switch may be more scalable. On the 
other hand, for those who have limited resources 
like us, building everything (in particular, hard- 
ware components) from the scratch to alleviate 
the above problems is not sensible because it will 
probably take too long to develop such compo- 
nents. Thus we looked for a solution which does 
not require custom made hardware components 
and is re-usable in the future once developed so 
that the evolutionary upgrade does not introduce 
delays. 
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Figure 1. Booting sequence 



2. Architecture and Hardware 

Each node is an extremely thin node which con- 
sists only of a Intel pentium IV 2.4GHz CPU, 512 
Mbytes DDR SDRAM, a mother board, and 4 
fast ethernet network interface cards(NIC). One 
of 4 NIC's has a socket for EPROM or EEPROM 
for the bootcode. Table^shows hardware compo- 
nents in the 36-node cluster and their costs. Ex- 
cept the chasis and the network cables, everything 
is off-the-shelf components and there is nothing 
special about them. The chasis is designed so 
that each crates accept any standard ATX-size 
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Table 1 

hardware components and prices 



component 


unit price(in $) 


no. of units 


net price(in $) 


Intel P-IV 2.4GHz CPU 


198 


36 


7,128 


ASUS P4-PE mother board 


170 


36 


6,120 


512 MB PC2700 DDR SDRAM 


93.5 


36 


3,366 


(3+1) RealTek 8139C NIC 


44 


36 


1,584 


180W Sun ATX power supply 


21 


36 


756 


network cable 


8 


36 


288 


chasis(200 x 91 x 75 cm) 


1,037 


1 


1,037 


total price 


534.5 x 36+ 1,037 




= 20,279 



mother board and an upgrade means just replac- 
ing mother boards with new one. The current 
chasis size is suitable for 64-node configuration 
and has room for additional 28 nodes. One PC 
with 360 GBytes hard disk serves as a front end 
server. 

Thus, development effort for our thin node 
cluster is mostly involved with setting up nec- 
essary software environment : booting, OS, and 
MPI parallel programming. Since there is no per- 
manent storage device on each nodes, booting is a 
little bit tricky and Linux operating system needs 
to be configured dynamically after the boot. For- 
tunately, there is a Linux solution, called "Linux 
Terminal Server Project" (LTSP) 0, which is de- 
veloped for the server-client situation similar to 
our case, a server booting up hosts of diskless 
client computers. In this scenario, instead of 
booting from the kernel image on a permanent 
media such as hard disk, floppy disk or flash 
memory device, an NIC which has a small size 
EPROM or EEPROM (for example 64 Kbytes) 
on the mother board does network booting. On 
power-up, this network card on the client node 
executes its bootcode and broadcast its IP re- 
quest and its MAC address to the local net- 
work by use of Dynamic Host Configuration 
Protocol(DHCP)pj]. The server responds to this 
DHCP request and replies with the basic IP in- 
formation such as client node IP address, net- 
mask setting, root file directory and kernel image 
name depending on the client MAC addresses. 
With the reply from the server, the client node 



configures its TCP/IP and fetches kernel image 
from a host computer by Trivial File Transfer 
Protocol(TFTP)|Hj. Once the kernel image is 
loaded on the node memory, the kernel starts exe- 
cuting and initializes the client node and set it up 
for normal operation. One may choose whether 
application softwares run on client nodes or runs 
on the server node. 

The main difference between LTSP setup and 
ours lies on the assumed network topology. LTSP 
relies on the star network connection and our 
project adopts 2-dimensional mesh structure. In 
our case, each nodes once booted, must act as 
a DHCP server and a TFTP server to the next 
client node in constrast to the LTSP situation 
that central server controls the other client nodes. 
This booting process may progress in parallel to 
speed up : the front end server in our cluster 
starts booting processes on 6 nodes simultane- 
ously and then these 6 nodes boot the next 6 
nodes, etc (see Fig[Q. After the booting proce- 
dure is completed, 2-D mesh routing is achieved 
by explicit 'route' commandjS] in a script called 
from Linux "init" script, 'route' assigns algorith- 
mically one of four ethernet devices, ethO, ethl, 
eth.2, and eth3 depending on the destination IP 
addresses and the logical node ID. Since the size 
of Linux routing table may grow upto 2048 el- 
ements by just changing kernel compile option, 
this kind of explicit routing work fine with a mod- 
erate size cluster. Ideally, one would like to have 
a distributed routing mechanism implemented on 
the kernel level but it is not part of the current 
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Linux kernel. Linux distribution used on the clus- 
ter is 'Wow Linux version 7.1', which is equiva- 
lent to Red Hat Linux 7.1 and the kernel version 
is 2.4.9. The version of LTSP package which we 
modified for our need is 3.0.5. MPICH and LAM 
implementation of MPI parallel programming en- 
vironment is available on the cluster. 

3. Performance and Discussion 

Fig. [2] shows the code performance of hybrid 
molecular dynamics simulation of two staggered 
quark flavor with m q a = 0.01 on a 8 3 x 512 lat- 
tice (the single node benchmark is for 8 3 x 32 
lattice). Onc-dimensional ring (N t — 512 is dis- 
tributed over the nodes) layout of lattice sites is 
used for the code and the code is not yet opti- 
mal for the 2-D mesh structure of the cluster. 
However, the code performance scales up nicely 
between 1 to 8 nodes. Sustained speed is about 
2.25 GFLOPS on 8 node and is 11% of the the- 
oretical peak speed. Thus, our cluster achieved 
- 0.5 MFLOPS/S with a straight FORTRAN 
code with no assembly language subroutine. We 
find that using more than 8 nodes with the cur- 
rent full QCD test code quickly degrades cluster 
performance due to non-optimal communication 
pattern of the test code. 

A conventional PC with Intel Pentium IV CPU 
is ordinarily equiped with 350W power supply. 
Since we put 180W power supply for each node 
and the cluster operates fine with this condition, 
the overall power requirement is successfully re- 
duced to a half of usual Linux cluster. Also, the 
footprint of our cluster is 200 x 91 x 75 cm, which 
is considerably smaller than that of stacking 36 
PC's. The physical dimension of the full cluster 
(64 node) will be even more beneficial since the 
same chasis will be used. Saved node cost would 
be ~ 100$(~ 15%) from doing without a hard 
disk. The whole construction is reusable as we 
planned since the mother board size is the only 
factor which needs to considered in an upgrade 
and the standard ATX size of mother board will 
stay with us for a while. 

Global MPI operations such as 
"MPI_ALLREDUCE" involves multiple hops 
in our cluster. Since LAM or MPICH relics 
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Figure 2. Performance of the cluster. The hori- 
zontal axis is the number of nodes and the vertical 
axis is GFLOPS 



on TCP/IP and each hops contributes to soft- 
ware and hardware latency in message passing, 
transversing many nodes reduces the efficiency of 
a program in our cluster. However, since the soft- 
ware latency is larger than the hardware latency, 
multiple hop will be less severe problem when 
user space devices such as Infiniband|S] becomes 
cheaply available. 
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